RNA-Seq Data Analysis ◾ 181
Once you have R, EdgeR, and limma package installed, you will be ready for the next steps
of the differential analysis which can be broken down into the following steps.
5.3.7.1 Data Preparation
For differential analysis, EdgeR requires the count data file and a sample info file. We have
already created the count data file in the previous step, but we need to create the sample
info file that describes the design of the study. We can create the sample info file manually
as shown in Table 5.1. The sample info file is tab-delimited, and the first column contains
the unique sample IDs or the BAM file names. Additional columns can contain the condi-
tions or factors depending on the study design. For our example data, we can create the
sample info file by executing the following bash script while you are in the main project
directory:
cd bam
ls *.bam \
| rev \
| cut -c 5-\
|rev > tmp.txt
echo -e “sampleid\tcondition\tpatient” \
> ../features/sampleinfo.txt
awk -F ‘_’ ‘{print $1 “_” $2 “\t” $1 “\t” $2}’ \
tmp.txt@ ../features/sampleinfo.txt
rm tmp.txt
cd ../features
This script creates the sample info file from the BAM file names and saves it in the “fea-
tures” subdirectory together with the read count data. For your own data, you may need
to modify this script or you can create yours using Linux bash commands or manually.
Then, you need to open R, make the “features” directory as the working directory, and
load both limma and edgeR packages.
library(limma)
library(edgeR)
Load both the count data file and sample info file to the R session as data frame.
seqdata <- read.delim(“htcount2.txt”, stringsAsFactors=FALSE)
sampleinfo <- read.delim(“sampleinfo.txt”, stringsAsFactors=FALSE)
Run the following command to display the first rows of the count data frame:
head(seqdata)
You will notice that the first two columns are the gene symbol and the transcript IDs. The
other six columns contain the read counts. In the next step, we need to separate the count